Machine Learning Methods

Evaluating Personal Job Market Prospects in 2024

Author

Anu Sharma, Cindy Guzman, Gavin Boss

Published

October 11, 2025

1 Overview

The analysis examines trends in Business Analytics, Data Science, and Machine Learning job postings, with a focus on the skills required for these roles. The study evaluates how varying skill combinations influence salary levels, remote work availability, and career progression pathways.

This analysis employs three main approaches: (1) KMeans clustering to segment jobs based on skill requirements, (2) regression models to predict salary based on skills and experience, and (3) classification models to identify Business Analysis, Data Science, and Machine Learning roles from other positions. The models use 25 technical skills as features along with experience and remote work indicators. Results show that experience is the dominant salary driver, jobs cluster into 6 distinct groups with different compensation and remote work patterns, and BA/ML/DS roles have clearly identifiable skill signatures.

2 Data Loading and Setup

The analysis starts by loading the Lightcast job postings dataset and identifying relevant skill columns. The dataset contains comprehensive information about job postings including titles, salaries, required skills, and other job characteristics.

Code
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import json
import re
from collections import Counter

pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"

# Load data from csv
df = pd.read_csv("data/lightcast_job_postings.csv", low_memory=False)
print(f"Dataset loaded: {len(df):,} rows, {len(df.columns)} columns")

# print(df.head())
Dataset loaded: 72,498 rows, 131 columns

2.1 Important Skills columns

The dataset contains multiple skill-related columns. After examining the schema, the columns ‘SKILLS_NAME’, ‘SOFTWARE_SKILLS_NAME’ and ‘SPECIALIZED_SKILLS_NAME’ provide the most detailed skill information for this analysis. These columns list the specific technical skills mentioned in each job posting.

3 Skills Data Preprocessing

The next step involves filtering the data to include only records with valid salary and title information. Then, binary features are created for 25 key technical skills covering ML, Data Science, and Business Analytics domains to enable machine learning analysis.

Code
# Apply filters
df_filtered = df.dropna(subset=['SALARY', 'TITLE'])

# Convert salary to numeric and filter
df_filtered['SALARY'] = pd.to_numeric(df_filtered['SALARY'], errors='coerce')
df_filtered = df_filtered[df_filtered['SALARY'] > 0]

print(f"Records after filtering: {len(df_filtered):,}")

df_skills = df_filtered.copy()

# Focus on key Business Analytics/ML/Data Science skills. Key skills for
# BA/ML/DS roles identified manually.
key_skills =  [
        'Python (Programming Language)',
        'R (Programming Language)',
        'SQL (Programming Language)',
        'Machine Learning',
        'Data Science',
        'Data Analysis',
        'Statistics',
        'Artificial Intelligence',
        'TensorFlow',
        'PyTorch (Machine Learning Library)',
        'Pandas (Python Package)',
        'NumPy (Python Package)',
        'Scikit-Learn (Python Package)',
        'Big Data',
        'Apache Spark',
        'Apache Hadoop',
        'Amazon Web Services',
        'Microsoft Azure',
        'Google Cloud Platform (Gcp)',
        'Data Visualization',
        'Tableau (Business Intelligence Software)',
        'Power BI',
        'Natural Language Processing (NLP)',
        'Computer Vision',
        'Deep Learning'
    ]

print(f"Using focused {len(key_skills)} BA/ML/DS technical skills for analysis")

# Create binary features for each key skill.
for skill in key_skills:
    # Clean skill name for column naming
    # Eg: R (Programming Language) --> has_r_programming_language
    skill_col_name = f'has_{skill.lower().replace(" ", "_").replace("-", "_").replace("(", "").replace(")", "")}'


    df_skills[skill_col_name] = (
        df_skills['SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) |
        df_skills['SOFTWARE_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False) |
        df_skills['SPECIALIZED_SKILLS_NAME'].str.contains(skill, case=False, na=False, regex=False)
    ).astype(int)

print("Binary skill features created")

# Create ML/DS role indicator using focused skills
core_ml_skills = [
    'has_machine_learning', 'has_artificial_intelligence', 'has_tensorflow', 'has_pytorch_machine_learning_library',
    'has_deep_learning', 'has_natural_language_processing_nlp', 'has_computer_vision'
]

core_ds_skills = [
    'has_python_programming_language', 'has_r_programming_language', 'has_statistics',
    'has_data_science', 'has_pandas_python_package', 'has_numpy_python_package',
    'has_scikit_learn_python_package', 'has_big_data'
]

core_ba_skills = [
    'has_data_analysis', 'has_data_visualization', 'has_sql_programming_language',
    'has_tableau_business_intelligence_software', 'has_power_bi'
]

# Role indicators
# ML roles are straightforward.
df_skills['is_ml_role'] = (
    (df_skills[core_ml_skills].sum(axis=1) > 0)
).astype(int)

# R language is primarily associated with Data Science field. So,
# if job requires R language or if it has more than one data science
# skills then it is considered DS role.
df_skills['is_ds_role'] = (
    df_skills['has_r_programming_language'] == 1 | (df_skills[core_ds_skills].sum(axis=1) > 1)
).astype(int)

# Business Analytics roles typically require SQL, visualization tools (Tableau, Power BI)
# and data analysis capabilities. If job has more than two BA skills, consider it a BA role.
df_skills['is_ba_role'] = (
    df_skills[core_ba_skills].sum(axis=1) >= 2
).astype(int)

# Remote work indicator
df_skills['is_remote'] = df_skills['REMOTE_TYPE'].fillna(0).astype(int)
df_skills['experience_years'] = df_skills['MIN_YEARS_EXPERIENCE'].fillna(0)

df_final = df_skills
print(f"Final dataset size: {len(df_final):,}")
print(f"ML roles identified: {df_final['is_ml_role'].sum():,}")
print(f"Data Science roles identified: {df_final['is_ds_role'].sum():,}")
print(f"Business Analytics roles identified: {df_final['is_ba_role'].sum():,}")
print(f"BA/ML/DS combined: {((df_final['is_ml_role'] == 1) | (df_final['is_ds_role'] == 1) | (df_final['is_ba_role'] == 1)).sum():,}")
Records after filtering: 30,808
Using focused 25 BA/ML/DS technical skills for analysis
Binary skill features created
Final dataset size: 30,808
ML roles identified: 3,226
Data Science roles identified: 2,877
Business Analytics roles identified: 10,831
BA/ML/DS combined: 12,821

For each of the 25 key skills, a binary indicator variable is created (1 if the skill is mentioned, 0 otherwise). This transforms the text skill data into numerical features suitable for machine learning models.

3.1 Role Classification Logic

Three role categories are identified based on technical skills:

  • ML roles: Require advanced ML/AI skills like TensorFlow, PyTorch, Deep Learning, NLP, Computer Vision
  • Data Science roles: Require R programming, Python with Statistics, or multiple data science tools (Pandas, NumPy, Scikit-learn)
  • Business Analytics roles: Require SQL, data analysis, visualization tools (Tableau, Power BI), typically 2+ BA skills

The analysis examines how these specialized skills impact salary and career opportunities. Machine learning models are used to find patterns that can guide job seekers in choosing which skills to develop.

4 Feature Engineering for ML

Before building models, the dataset is prepared by selecting relevant columns. This includes the salary (target variable), skill indicators, remote work status, and experience years.

Code
# Just prepare the modeling dataset
modeling_cols = ['SALARY', 'is_ml_role', 'is_ds_role', 'is_ba_role', 'is_remote', 'experience_years'] + \
            [col for col in df_final.columns if col.startswith('has_')]

df_modeling = df_final[modeling_cols].copy()

print("Features for modeling:")
print(f"Dataset shape: {df_modeling.shape}")
print(f"Columns: {list(df_modeling.columns)}")
print(f"Missing values: {df_modeling.isnull().sum().sum()}")
Features for modeling:
Dataset shape: (30808, 31)
Columns: ['SALARY', 'is_ml_role', 'is_ds_role', 'is_ba_role', 'is_remote', 'experience_years', 'has_python_programming_language', 'has_r_programming_language', 'has_sql_programming_language', 'has_machine_learning', 'has_data_science', 'has_data_analysis', 'has_statistics', 'has_artificial_intelligence', 'has_tensorflow', 'has_pytorch_machine_learning_library', 'has_pandas_python_package', 'has_numpy_python_package', 'has_scikit_learn_python_package', 'has_big_data', 'has_apache_spark', 'has_apache_hadoop', 'has_amazon_web_services', 'has_microsoft_azure', 'has_google_cloud_platform_gcp', 'has_data_visualization', 'has_tableau_business_intelligence_software', 'has_power_bi', 'has_natural_language_processing_nlp', 'has_computer_vision', 'has_deep_learning']
Missing values: 0

The modeling dataset now contains binary skill features, experience, remote work indicator, and salary information. This structured format allows application of various machine learning techniques.

5 Unsupervised Learning: KMeans Clustering Based on Skills

The first machine learning approach uses KMeans clustering to discover natural groupings in the job market. This unsupervised technique groups jobs with similar skill profiles together, without using salary information. The goal is to see if jobs naturally segment into distinct categories based on their requirements.

Code
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, f1_score, confusion_matrix, classification_report

# Prepare features for clustering using skills and other features
skill_feature_cols = [col for col in df_modeling.columns if col.startswith('has_')]
print(f"Available skill features: {len(skill_feature_cols)}")

# Base clustering features
clustering_features = skill_feature_cols + ['experience_years', 'is_remote']

# Encode ONET and NAICS6.
le_onet = LabelEncoder()
df_modeling['onet_encoded'] = le_onet.fit_transform(df_final['ONET'].fillna('Unknown'))
clustering_features.append('onet_encoded')

le_naics = LabelEncoder()
df_modeling['naics_encoded'] = le_naics.fit_transform(df_final['NAICS6'].fillna('Unknown'))
clustering_features.append('naics_encoded')

# Prepare clustering data
X_cluster = df_modeling[clustering_features].fillna(0)

# Scale features
scaler_cluster = StandardScaler()
X_cluster_scaled = scaler_cluster.fit_transform(X_cluster)

# KMeans clustering
kmeans = KMeans(n_clusters=6, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_cluster_scaled)
df_modeling['cluster'] = clusters

# print("Skills based clustering completed")
# print("Cluster centers:")
# for i, center in enumerate(kmeans.cluster_centers_):
#     print(f"Cluster {i}: {center}")
Available skill features: 25

The clustering model groups similar jobs together using skill patterns, experience requirements, and job characteristics. The algorithm assigns each job to one of 6 clusters. Now the characteristics of each cluster can be examined to understand what makes them distinct.

Code
# Analyze clustering.
cluster_summary = df_modeling.groupby('cluster').agg({
    'SALARY': ['count', 'mean'],
    'is_ml_role': 'mean',
    'is_ds_role': 'mean',
    'is_ba_role': 'mean',
    'is_remote': 'mean',
    'experience_years': 'mean'
}).round(2)

cluster_summary.columns = ['count', 'avg_salary', 'ml_role_pct', 'ds_role_pct', 'ba_role_pct',
                        'remote_percentage', 'avg_experience']
cluster_summary = cluster_summary.reset_index()

# Compute combined BA/ML/DS percentage on-the-fly
# A job has BA/ML/DS if it has any of the three role types
cluster_summary['ml_ds_ba_combined_pct'] = cluster_summary.apply(
    lambda row: ((df_modeling[df_modeling['cluster'] == row['cluster']][['is_ml_role', 'is_ds_role', 'is_ba_role']].sum(axis=1) > 0).mean()),
    axis=1
).round(2)

print("Skills based Cluster Summary:")
print(cluster_summary)

# Visualize cluster characteristics.
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=('Cluster Size', 'Average Salary', 'BA/ML/DS Role %',
                'Remote Work %', 'Avg Experience', 'Salary Distribution'),
    specs=[[{"type": "bar"}, {"type": "bar"}, {"type": "bar"}],
        [{"type": "bar"}, {"type": "bar"}, {"type": "scatter"}]]
)

fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['count'], name="Count"), row=1, col=1)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_salary'], name="Avg Salary"), row=1, col=2)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ml_role_pct'], name="ML %"), row=1, col=3)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ds_role_pct'], name="DS %"), row=1, col=3)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['ba_role_pct'], name="BA %"), row=1, col=3)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['remote_percentage'], name="Remote %"), row=2, col=1)
fig.add_trace(go.Bar(x=cluster_summary['cluster'], y=cluster_summary['avg_experience'], name="Experience"), row=2, col=2)

# Salary distribution by cluster.
fig.add_trace(
    go.Scatter(
        x=df_modeling['cluster'],
        y=df_modeling['SALARY'],
        mode='markers',
        opacity=0.6,
        name="Jobs"
    ),
    row=2, col=3
)

fig.update_layout(
    height=650,
    showlegend=False,
    template="plotly_white",
    title={
        'text': "Skills-Based KMeans Clustering Results",
        'y': 0.98,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top'
    },
    margin=dict(t=80)
)
fig.show()
Skills based Cluster Summary:
   cluster  count  avg_salary  ml_role_pct  ds_role_pct  ba_role_pct  \
0        0    583   139707.42         0.60         0.26         0.70   
1        1  10189   144796.54         0.14         0.00         0.04   
2        2  13313   100969.83         0.01         0.01         0.28   
3        3   6573   108557.35         0.17         0.39         0.95   
4        4     77   140001.35         1.00         0.32         0.31   
5        5     73   117793.86         0.55         0.32         0.93   

   remote_percentage  avg_experience  ml_ds_ba_combined_pct  
0               0.44            4.45                   0.90  
1               0.25            7.80                   0.17  
2               0.39            2.00                   0.29  
3               0.48            3.27                   0.99  
4               0.34            4.23                   1.00  
5               0.56            3.01                   0.96  

5.1 Insights from KMeans Clustering

The clustering analysis grouped jobs based on their skill requirements and characteristics. The analysis identified 6 distinct job clusters, each with different salary levels, remote work availability, and skill profiles.

Key Findings:

  • Business Analytics dominates: 10,831 BA roles vs. 3,226 ML and 2,877 DS
  • Cluster 0 (583 jobs, $140K): High-skill hybrid (60% ML, 26% DS, 70% BA)
  • Cluster 1 (10,189 jobs, $145K): Mostly general tech, only 17% BA/DS/ML, highest pay
  • Cluster 2 (13,313 jobs, $101K): Entry-level, lowest experience (2 yrs), BA-focused (28%)
  • Cluster 3 (6,573 jobs, $109K): BA-heavy (95%) with DS overlap (39%)
  • Cluster 4 (77 jobs, $140K): Pure ML specialists (100% ML), niche but high-paying
  • Cluster 5 (73 jobs, $118K): Hybrid roles (96% BA/DS/ML), most remote-friendly (56%)
  • Remote work: 25%–56% across clusters
  • Experience: 2.0–7.8 years, showing clear career progression

Takeaways for Job Seekers:

  • Most opportunities: Business Analytics (SQL, Tableau, Power BI, visualization)
  • Highest pay + volume: Cluster 1 ($145K, 10K+ jobs) — general tech roles
  • Entry path: Cluster 2 ($101K, 13K jobs) — BA-focused, lowest experience needed
  • BA-focused growth: Cluster 3 ($109K) — strong BA demand with DS hybrid edge
  • Specialist track: Cluster 4 ($140K) — pure ML, fewer jobs but high pay
  • Hybrid advantage: Cluster 0 ($140K) and Cluster 5 ($118K, 56% remote) — multi-skill roles with flexibility

6 Supervised Learning: Multiple Regression

The second approach uses supervised learning to predict salary based on skills and experience. Two regression models are trained: Linear Regression (assumes linear relationships) and Random Forest (captures complex non-linear patterns). This analysis identifies which skills and factors most strongly influence compensation.

Code
# Identify regression features.
# Focus on skills (not role labels) to understand how skills directly affect salary
regression_features = skill_feature_cols + ['experience_years', 'is_remote']

# Prepare regression data using salary as the target variable
X_reg = df_modeling[regression_features].fillna(0)
y_reg = df_modeling['SALARY']

X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

print(f"Training set size: {len(X_train):,}")
print(f"Test set size: {len(X_test):,}")

# Scale features
scaler_reg = StandardScaler()
X_train_scaled = scaler_reg.fit_transform(X_train)
X_test_scaled = scaler_reg.transform(X_test)

# Multiple Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)

# Random Forest Regression
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
rf_reg.fit(X_train_scaled, y_train)

print("Skills based regression models training completed")
Training set size: 24,646
Test set size: 6,162
Skills based regression models training completed

Both models are trained on 80% of the data and will be evaluated on the remaining 20% test set. The Random Forest model can capture non-linear relationships and interactions between skills, while Linear Regression provides a baseline for comparison.

Code
# Evaluate regression models
# Linear Regression predictions
y_pred_lr = lr.predict(X_test_scaled)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

# Random Forest predictions
y_pred_rf = rf_reg.predict(X_test_scaled)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("Skills-based Regression Model Performance:")
print(f"Linear Regression - RMSE: ${rmse_lr:,.2f}, R²: {r2_lr:.4f}")
print(f"Random Forest - RMSE: ${rmse_rf:,.2f}, R²: {r2_rf:.4f}")

# Feature importance for Random Forest
# Only use features that actually exist in the model
actual_feature_names = [col for col in regression_features if col in X_train.columns]
importances = rf_reg.feature_importances_

# Visualize feature importance
fig = px.bar(x=actual_feature_names, y=importances,
            title="Skills Impact on Salary (Random Forest Feature Importance)",
            labels={'x': 'Features', 'y': 'Importance'})
fig.update_layout(template="plotly_white", xaxis_tickangle=-45)
fig.show()

# Top skills by salary impact
skill_importance = list(zip(actual_feature_names, importances))
skill_importance.sort(key=lambda x: x[1], reverse=True)
print("\nTop skills by salary impact:")
for skill, importance in skill_importance[:10]:
    print(f"{skill}: {importance:.4f}")
Skills-based Regression Model Performance:
Linear Regression - RMSE: $37,899.01, R²: 0.2780
Random Forest - RMSE: $32,558.54, R²: 0.4672

Top skills by salary impact:
experience_years: 0.4932
is_remote: 0.0728
has_data_analysis: 0.0426
has_tableau_business_intelligence_software: 0.0372
has_amazon_web_services: 0.0361
has_sql_programming_language: 0.0350
has_statistics: 0.0302
has_python_programming_language: 0.0300
has_machine_learning: 0.0265
has_data_science: 0.0261

6.1 Regression Analysis: What drives salary?

Prediction models were built to understand how skills influence salary. The Random Forest model achieved R2 of 0.47 compared to 0.28 for Linear Regression, showing that skill-salary relationships are complex.

Model Performance:

  • Random Forest: R² = 0.47 (explains 47% of salary variation), RMSE = $32,559
  • Linear Regression: R² = 0.28
  • Insight: Skills alone do not fully explain salary — other factors also matter.

Key Salary Drivers (Feature Importance):

  1. Experience (0.49): Largest factor, nearly half of salary variation
  2. Remote work (0.07): Flexibility influences pay differences
  3. Data Analysis (0.04): Core analytical capability
  4. Tableau (0.04): Visualization and BI tool
  5. AWS (0.04): Cloud computing platform
  6. SQL (0.04): Database querying and manipulation
  7. Statistics (0.03): Analytical foundation
  8. Python (0.03): Programming language

Career Implications:

  • Experience is critical — the strongest driver of salary.
  • Remote work adds value — flexibility can boost compensation.
  • Skill combinations matter — technical, analytical, and cloud skills together shape salary outcomes.

Summary: Salary is not determined by skills alone. Experience and work flexibility are key, while technical skills provide additional differentiation.

7 Supervised Learning: Classification to Identify BA/ML/DS Roles

The third approach uses classification to distinguish ML/Data Science roles from Business Analytics and other positions. A Random Forest Classifier is trained to predict whether a job is an ML/DS role based on its skill requirements. This analysis reveals which skills are the strongest “signature” indicators that distinguish ML/DS positions from BA roles.

Code
# Prepare features for classification.
classification_features = skill_feature_cols + ['experience_years', 'is_remote']

# Prepare classification data
X_clf = df_modeling[classification_features].fillna(0)
# Target: ML/DS roles (computed from is_ml_role OR is_ds_role)
y_clf = ((df_modeling['is_ml_role'] == 1) | (df_modeling['is_ds_role'] == 1)).astype(int)

# Train/test split for classification
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)

# Scale features
scaler_clf = StandardScaler()
X_train_clf_scaled = scaler_clf.fit_transform(X_train_clf)
X_test_clf_scaled = scaler_clf.transform(X_test_clf)

# Random Forest Classification
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train_clf_scaled, y_train_clf)

print("Skills-based classification model trained successfully!")
Skills-based classification model trained successfully!

The classifier learns patterns that distinguish ML/DS roles from BA and other positions based on their skill profiles. The model is now evaluated to see how accurately it can identify these specialized ML/DS roles versus the more common BA positions.

Code
# Random Forest predictions
y_pred_rf_clf = rf_clf.predict(X_test_clf_scaled)
accuracy_rf = accuracy_score(y_test_clf, y_pred_rf_clf)
f1_rf = f1_score(y_test_clf, y_pred_rf_clf)

print("Skills based Classification Model Performance:")
print(f"Random Forest - Accuracy: {accuracy_rf:.4f}, F1 Score: {f1_rf:.4f}")

# Confusion Matrix for Random Forest
cm = confusion_matrix(y_test_clf, y_pred_rf_clf)

# Visualize confusion matrix
fig = px.imshow(cm, text_auto=True, aspect="auto",
                title="Confusion Matrix - ML/DS Role Classification",
                labels=dict(x="Predicted", y="Actual"),
                color_continuous_scale="Blues")

fig.update_layout(template="plotly_white")
fig.update_xaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])
fig.update_yaxes(tickvals=[0,1], ticktext=['Not ML/DS', 'ML/DS'])
fig.show()

print("Classification Report:")
print(classification_report(y_test_clf, y_pred_rf_clf))

# Only use features that actually exist in the classification model
clf_actual_feature_names = [col for col in classification_features if col in X_train_clf.columns]
clf_importances = rf_clf.feature_importances_

# Visualize classification feature importance
fig = px.bar(x=clf_actual_feature_names, y=clf_importances,
            title="Skills Impact on ML/Data Science Role Classification",
            labels={'x': 'Features', 'y': 'Importance'})
fig.update_layout(template="plotly_white", xaxis_tickangle=-45)
fig.show()
Skills based Classification Model Performance:
Random Forest - Accuracy: 0.9995, F1 Score: 0.9986
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5082
           1       1.00      1.00      1.00      1080

    accuracy                           1.00      6162
   macro avg       1.00      1.00      1.00      6162
weighted avg       1.00      1.00      1.00      6162

7.1 Classification Results: Identifying ML/Data Science Roles

The classification model predicts whether a job is an ML/Data Science role based on its skill requirements. The Random Forest Classifier achieved strong performance in distinguishing these specialized roles from Business Analytics and other positions.

Model Performance Interpretation: - Accuracy of 99.95% shows the model correctly identifies nearly all ML/DS roles. - ML/DS roles have very distinct skill patterns compared to BA and other data jobs. - This high accuracy indicates the skill based criteria effectively separates ML/DS roles from the more common BA positions.

Feature Importance This chart shows which skills are strongest predictors of ML/DS classification. Skills with higher bars are the “signature” skills that clearly distinguish ML/DS roles from BA and general analyst positions.

Actionable Insights - The high accuracy (99.95%) shows ML/DS roles require distinctly different skill sets from BA roles - ML/DS roles focus on programming (Python, R), statistical modeling, and ML frameworks (TensorFlow, PyTorch) - BA roles focus on SQL, visualization tools (Tableau, Power BI), and data analysis - Focus on ML-specific features to signal ML/DS capabilities and differentiate from BA positions - Building expertise in high importance features directly increases ML/DS role readiness

8 Model Results Visualization

This section provides a consolidated view of all three modeling approaches. The comparison shows how different models perform on their respective tasks and highlights the most impactful skills across different analyses.

Code
# Summarize core model performance
model_summary = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest (Regression)', 'Random Forest (Classification)'],
    'R² / Accuracy': [r2_lr, r2_rf, accuracy_rf],
    'RMSE / F1 Score': [rmse_lr, rmse_rf, f1_rf]
})
print(model_summary)

# Visualization of model results
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Model Performance Comparison', 'Skills vs Salary Impact'),
    specs=[[{"type": "bar"}, {"type": "bar"}]]
)

# Model performance comparison
models = ['Linear Regression', 'Random Forest Regression', 'Random Forest Classification']
metrics = [r2_lr, r2_rf, accuracy_rf]

fig.add_trace(go.Bar(x=models, y=metrics, name="Performance"), row=1, col=1)

# Skills vs salary impact
top_skills_salary = skill_importance[:8]
fig.add_trace(go.Bar(x=[s[0] for s in top_skills_salary],
                    y=[s[1] for s in top_skills_salary], name="Salary Impact"), row=1, col=2)

fig.update_layout(
    height=450,
    showlegend=False,
    template="plotly_white",
    title={
        'text': "Core Model Results - BA/ML/DS Skills Analysis",
        'y': 0.98,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
    },
    margin=dict(t=80)
)
fig.show()
                            Model  R² / Accuracy  RMSE / F1 Score
0               Linear Regression       0.278032     37899.005358
1      Random Forest (Regression)       0.467166     32558.537199
2  Random Forest (Classification)       0.999513         0.998609

The three modeling approaches provide complementary insights into the job market. Clustering reveals natural job segments, regression quantifies salary drivers, and classification identifies role-defining skills. Together, these analyses offer actionable guidance for career planning.

9 Key Takeaways and Recommendations

9.1 Summary of Findings

The analysis of business analytics, data science and machine learning job postings reveals several important patterns:

  1. Role Distribution: Business Analytics dominates the job market with 10,831 positions (35% of dataset), compared to 3,226 ML roles (10%) and 2,877 DS roles (9%). Many positions require hybrid skills - 12,821 jobs (42%) need ML, DS, or BA capabilities.

  2. Skill-Based Job Segmentation: Jobs cluster into 6 distinct groups with varying role compositions:

  • Cluster 0 (583 jobs, $140K): High-skill hybrid (90% BA/ML/DS combined: 60% ML, 26% DS, 70% BA)
  • Cluster 1 (10,189 jobs, $145K): General technical roles (only 17% BA/ML/DS) - highest pay
  • Cluster 2 (13,313 jobs, $101K): Entry-level (29% BA/ML/DS, 28% BA, 2.0 years experience)
  • Cluster 3 (6,573 jobs, $109K): BA-heavy (99% BA/ML/DS: 95% BA, 39% DS, 17% ML)
  • Cluster 4 (77 jobs, $140K): Pure ML specialists (100% BA/ML/DS: 100% ML)
  • Cluster 5 (73 jobs, $118K): Hybrid (96% BA/ML/DS: 55% ML, 32% DS, 93% BA, 56% remote)
  • Remote work availability varies from 25% to 56% across clusters
  1. Salary Drivers: Experience dominates (49% importance) followed by remote work capability (7%). Technical skills contribute 3-4% each: data_analysis (4.3%), Tableau (3.7%), AWS (3.6%), SQL (3.5%), Statistics (3.0%), Python (3.0%). The R² of 0.47 shows skills explain about half of salary variation.

  2. Role Differentiation: ML/DS roles have distinct skill patterns, achieving 99.95% classification accuracy. This indicates these specialized positions require clearly different capabilities than BA or general analyst roles.

9.2 Recommendations for Job Seekers

For Career Advancement: - Gain experience - it’s the single biggest salary driver (49% importance) - Develop remote work capabilities - adds 7% to salary potential - Learn practical tools: Data analysis (4.3%), Tableau (3.7%), AWS (3.6%), SQL (3.5%), Statistics (3.0%), Python (3.0%) - General technical roles (Cluster 1) pay highest ($145K) with most opportunities (10,189 jobs)

For Business Analytics Path: - Highest volume opportunity: 10,831 BA roles identified (35% of job market) - Core BA skill combo: SQL + Tableau/Power BI + data visualization + data analysis (2+ skills required) - Best BA cluster: Cluster 3 (6,573 jobs at $109K) with 95% BA roles - Hybrid advantage: Many BA roles overlap with DS (39% in Cluster 3), so learning Python/statistics opens DS opportunities

For Transitioning to ML/Data Science: - ML path (3,226 roles): Most specialized and competitive - requires TensorFlow, PyTorch, Deep Learning, NLP - DS path (2,877 roles): Requires R or Python + Statistics + multiple DS tools (Pandas, NumPy, Scikit-learn) - Pure ML roles (Cluster 4): Only 77 jobs at $140K - highly specialized - The 99.95% classification accuracy shows these roles need very specific skill combinations

For Maximizing Opportunities: - Most jobs + highest pay: Cluster 1 (10,189 jobs at $145K) - general technical roles, only 17% need BA/ML/DS - Entry-level: Cluster 2 (13,313 jobs at $101K) - 29% BA/ML/DS, lowest experience requirement (2.0 years) - BA opportunities: Cluster 3 (6,573 jobs at $109K) - 99% need BA/ML/DS (95% BA, 39% DS overlap) - Remote work: Cluster 5 (73 jobs at $118K, 56% remote) - 96% hybrid BA/ML/DS roles - High-skill hybrid: Cluster 0 (583 jobs at $140K) - 90% BA/ML/DS (60% ML + 70% BA combination)

9.3 Limitations and Considerations

  • The analysis is based on job posting data which may not reflect actual hiring outcomes
  • Skill requirements in job posts may differ from day-to-day job responsibilities
  • Market conditions and geographic factors also influence salaries beyond just skills
  • The models identify patterns but don’t capture all nuances of career success